The report explores a dataset containing quality and 11 features for 1599 red wines observations.
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
Looking for the number of NA values for each column in the dataframe. It appears that none are missing.
## [,1]
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
Correlation showing all variables against quality. It appears that four attributes have a weak to moderate correlation (either negative or positive) with quality: volatile.acidity, citric.acid, sulphates, and alcohol.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Correlations between all variables.
A quick matrix chart showing some of the relationships between variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The qualities conform to a fairly normal distribution. While the scores limits were 0-10, no wines fell below 3 or scored above 8 and most falling below a 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] 0.9809084
Since the fixed acidity is positively skewed, we’ll try some transforms.
Fixed acidity appears to be postively skewed in all charts, but log transformation gives the best normal distribution.
## [1] 0.1142376
The correlation with the log transform correlates worse than the normal attribute.
The very slight positive correlation can be seen in the trendline.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] 0.6703331
Volatile acidity is positively skewed.
Squareroot of squareroot appears to give the best normal distribution.
## [1] -0.3934108
But it doesn’t create a much stronger correlation.
The negative correlation is obvious in the trendline.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] 0.3177403
Citric acid appears to be positively skewed, but the jump around .5 reduces the skewness measure.
Through all the transforms, it appears that squareroot creates the most normal distribution, but still has a large number of wines with almost no citric acid.
## [1] 0.2066822
And the squareroot actually lowers the correlation.
The weak positive correlation can be seen in the trendline.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## [1] 4.53214
Residual sugar has an extreme positive skewness.
The reciprocal transformation appears to bring residual sugar closest to a normal distribution.
## [1] -0.02898281
While this almost doubles the correlation (negatively), the correlation is insignificant.
The trendline is almost completely straight showing no real correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] 5.669694
Chlorides also has an extremely positive skewness.
Squareroot of squareroot brings chlorides closest to normal distribution.
## [1] -0.1656209
The squareroot of squareroot only creates a slighty higher correlation with quality
The negative trendline show the slight correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] 1.248222
Free sulfur dioxide has a positive skewness.
Log transform creates the nicest normal distribution.
## [1] -0.05008749
But the correlation remains almost unchanged.
The fairly flat trendline show the lack of correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## [1] 1.512689
Total sulfur dioxide has a positive skewness.
Log transform creates a fairly normal distribution.
## [1] -0.1701427
The extremely weak correlation between total sulfur dioxide and quality can be seen in the trendline.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] 0.07115397
Density shows a fairly normal distribution so no transforms will be performed.
The slight negative correlation can be seen in the trendline.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] 0.1933203
pH shows a normal distribution.
The trendline shows almost no correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] 2.424118
Sulphates are highly positively skewed.
Reciprical transform shows the best normal distribution.
## [1] -0.3403317
This actually increased correlation and turned it negative. We’ll explore both options.
The reciprocal sulphates show a much stronger negative trend with quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## [1] 0.8592144
Alcohol has a positive skew.
The best normalization is created by the squareroot transform.
## [1] 0.4768205
No real change in corrlation, but we’ll plot both.
They both appear to have the strongest trendlines we’ve seen with quality.
Now we’ll see see how the four attributes with the hightest correlations with quality correlate with each other.
## [1] -0.5524957
Pretty strong negative correlation between volatile acidity and citric acid, but both of those attributes could be correlated merely because they are acids in the wine.
## [1] -0.2609867
Weak correlation between volatile acidity and sulphates.
## [1] -0.202288
Weak correlation between volatile acidity and alcohol.
## [1] 0.31277
Medium correlation between citric acid and sulphates.
## [1] 0.1099032
Very little correlation between citric acid and alcohol.
## [1] 0.09359475
Very little correlation between sulphates and alchohol.
## Description One Alcohol had the highest positive correlation with wine quality. This makes sense as one of the primary reason to have an alcoholic beverage in the first place is for alcohol. At around 7 level quality the vast majority of those wines contain an alcohol percentage greater than 10%.
## Description Two Sulphates had the second highest positive correlation with quality. Sulphates are additives to wines which acts as antimicrobial and antioxidant agents. These preserve the wines so perhaps an increase in sulphates would produce less likelihood that the wine tasted would have gone bad.
## Description Three This shows little correlation between the two highest positively correlated attributes to wine quality in the dataset, sulphates and alcohol. Since we’re ultimately trying to find the attributes which influence the quality of wine and possibly to predict the quality based on these attributes, it’s important that the features are not redundant. Redundant attributes lead to a model which overfits predictions.
As wine quality was pretty much a categorical value containing mostly values of 5 or 6, these highly influenced the appearance of the graphs correlating with quality. I was hoping that some of the transforms would give a higher correlation with quality than just the normal attribute, but I didn’t see any real evidence of this with the transforms I created.
Some limitations are due to the volume of data. 1599 records is not a large dataset, perhaps I should have chosen the white wines instead. To investigate the data further, I would like to see a larger set. In addition, while the quality measure was a median of three wine experts, I would also like to see the mean in order to show a more continuous variable quality measurement.